9 research outputs found

    Regularized and Smooth Double Core Tensor Factorization for Heterogeneous Data

    Full text link
    We introduce a general tensor model suitable for data analytic tasks for heterogeneous data sets, wherein there are joint low-rank structures within groups of observations, but also discriminative structures across different groups. To capture such complex structures, a double core tensor (DCOT) factorization model is introduced together with a family of smoothing loss functions. By leveraging the proposed smoothing function, the model accurately estimates the model factors, even in the presence of missing entries. A linearized ADMM method is employed to solve regularized versions of DCOT factorizations, that avoid large tensor operations and large memory storage requirements. Further, we establish theoretically its global convergence, together with consistency of the estimates of the model parameters. The effectiveness of the DCOT model is illustrated on several real-world examples including image completion, recommender systems, subspace clustering and detecting modules in heterogeneous Omics multi-modal data, since it provides more insightful decompositions than conventional tensor methods

    Margin Maximization in Attention Mechanism

    Full text link
    Attention mechanism is a central component of the transformer architecture which led to the phenomenal success of large language models. However, the theoretical principles underlying the attention mechanism are poorly understood, especially its nonconvex optimization dynamics. In this work, we explore the seminal softmax-attention model f(X)=⟨Xv,softmax(XWp)⟩f(\boldsymbol{X})=\langle \boldsymbol{Xv}, \texttt{softmax}(\boldsymbol{XWp})\rangle, where, X\boldsymbol{X} is the token sequence and (v,W,p)(\boldsymbol{v},\boldsymbol{W},\boldsymbol{p}) are tunable parameters. We prove that running gradient descent on p\boldsymbol{p}, or equivalently W\boldsymbol{W}, converges in direction to a max-margin solution that separates locally-optimal\textit{locally-optimal} tokens from non-optimal ones. This clearly formalizes attention as a token separation mechanism. Remarkably, our results are applicable to general data and precisely characterize optimality\textit{optimality} of tokens in terms of the value embeddings Xv\boldsymbol{Xv} and problem geometry. We also provide a broader regularization path analysis that establishes the margin maximizing nature of attention even for nonlinear prediction heads. When optimizing v\boldsymbol{v} and p\boldsymbol{p} simultaneously with logistic loss, we identify conditions under which the regularization paths directionally converge to their respective hard-margin SVM solutions where v\boldsymbol{v} separates the input features based on their labels. Interestingly, the SVM formulation of p\boldsymbol{p} is influenced by the support vector geometry of v\boldsymbol{v}. Finally, we verify our theoretical findings via numerical experiments and provide insights

    A Penalty Based Method for Communication-Efficient Decentralized Bilevel Programming

    Full text link
    Bilevel programming has recently received attention in the literature, due to a wide range of applications, including reinforcement learning and hyper-parameter optimization. However, it is widely assumed that the underlying bilevel optimization problem is solved either by a single machine or in the case of multiple machines connected in a star-shaped network, i.e., federated learning setting. The latter approach suffers from a high communication cost on the central node (e.g., parameter server) and exhibits privacy vulnerabilities. Hence, it is of interest to develop methods that solve bilevel optimization problems in a communication-efficient decentralized manner. To that end, this paper introduces a penalty function based decentralized algorithm with theoretical guarantees for this class of optimization problems. Specifically, a distributed alternating gradient-type algorithm for solving consensus bilevel programming over a decentralized network is developed. A key feature of the proposed algorithm is to estimate the hyper-gradient of the penalty function via decentralized computation of matrix-vector products and few vector communications, which is then integrated within our alternating algorithm to give the finite-time convergence analysis under different convexity assumptions. Owing to the generality of this complexity analysis, our result yields convergence rates for a wide variety of consensus problems including minimax and compositional optimization. Empirical results on both synthetic and real datasets demonstrate that the proposed method works well in practice
    corecore